Relationship-Based Clustering and Visualization for High-Dimensional Data Mining

نویسندگان

  • Alexander Strehl
  • Joydeep Ghosh
چکیده

In several real-life data-mining applications, data reside in very high (1000 or more) dimensional space, where both clustering techniques developed for low-dimensional spaces (k-means, BIRCH, CLARANS, CURE, DBScan, etc.) as well as visualization methods such as parallel coordinates or projective visualizations, are rendered ineffective. This paper proposes a relationship-based approach that alleviates both problems, side-stepping the “curseof-dimensionality” issue by working in a suitable similarity space instead of the original high-dimensional attribute space. This intermediary similarity space can be suitably tailored to satisfy business criteria such as requiring customer clusters to represent comparable amounts of revenue. We apply efficient and scalable graph-partitioning-based clustering techniques in this space. The output from the clustering algorithm is used to re-order the data points so that the resulting permuted similarity matrix can be readily visualized in two dimensions, with clusters showing up as bands. While two-dimensional visualization of a similarity matrix is by itself not novel, its combination with the order-sensitive partitioning of a graph that captures the relevant similarity measure between objects provides three powerful properties: (i) the high-dimensionality of the data does not affect further processing once the similarity space is formed; (ii) it leads to clusters of (approximately) equal importance, and (iii) related clusters show up adjacent to one another, further facilitating the visualization of results. The visualization is very helpful for assessing and improving clustering. For example, actionable recommendations for splitting or merging of clusters can be easily derived, and it also guides the user toward the right number of clusters. Results are presented on a real retail industry dataset of several thousand customers and products, as well as on clustering of web-document collections and of web-log sessions. (Cluster Analysis; Graph Partitioning; High Dimensional; Visualization; Retail Customers; Text Mining; Web-Log Analysis)

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

High-Dimensional Unsupervised Active Learning Method

In this work, a hierarchical ensemble of projected clustering algorithm for high-dimensional data is proposed. The basic concept of the algorithm is based on the active learning method (ALM) which is a fuzzy learning scheme, inspired by some behavioral features of human brain functionality. High-dimensional unsupervised active learning method (HUALM) is a clustering algorithm which blurs the da...

متن کامل

Relationship-based Visualization of High-dimensional Data Clusters

In several real-life data mining applications, data resides in very high (> 1000) dimensional space, where both clustering techniques developed for low dimensional spaces (k-means, BIRCH, CLARANS, CURE, DBScan etc) as well as visualization methods such as parallel coordinates or projective visualizations, are rendered ineffective. This paper proposes a relationship based approach to clustering ...

متن کامل

Visual Cluster Analysis in Data Mining

Clustering is a major technique in data mining. However the numeri-cal feedback of clustering algorithms is difficult for user to have an intuitiveoverview of the dataset that they deal with. Visualization has been proven to bevery helpful for high-dimensional data analysis. Therefore it is desirable to in-troduce visualization techniques with user’s domain knowledge into cluste...

متن کامل

Customer behavior mining based on RFM model to improve the customer relationship management

Companies’ managers are very enthusiastic to extract the hidden and valuable knowledge from their organization data. Data mining is a new and well-known technique, which can be implemented on customers data and discover the hidden knowledge and information from customers' behaviors. Organizations use data mining to improve their customer relationship management processes. In this paper R, F, an...

متن کامل

CUSTOMER CLUSTERING BASED ON FACTORS OF CUSTOMER LIFETIME VALUE WITH DATA MINING TECHNIQUE

Organizations have used Customer Lifetime Value (CLV) as an appropriate pattern to classify their customers. Data mining techniques have enabled organizations to analyze their customers’ behaviors more quantitatively. This research has been carried out to cluster customers based on factors of CLV model including length, recency, frequency, and monetary (LRFM) through data mining. Based on LRFM,...

متن کامل

ذخیره در منابع من


  با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:
  • INFORMS Journal on Computing

دوره 15  شماره 

صفحات  -

تاریخ انتشار 2003